PG DS - Data Science Capstone Project - Real Estate

DESCRIPTION

Dataset Description

Variables

Project Task: Week 1

Data Import and Preparation:

Exploratory Data Analysis (EDA):

Data Import and Preparation

Import necessary libraries

Figure out the primary key and look for the requirement of indexing.

Note:

Missing value task:- Gauge the fill rate of the variables and devise plans for missing value treatment. Please explain explicitly the reason for the treatment chosen for each variable.

Note:

Duplicate values check

Descriptive Statistical analysis Train dataset

Missing value treatment

Exploratory Data Analysis

**4.Perform debt analysis. You may take the following steps:**

**4.a) Explore the top 2,500 locations where the percentage of households with a second mortgage is the highest and percent ownership is above 10 percent. Visualize using geo-map. You may keep the upper limit for the percent of households with a second mortgage to 50 percent**

Visualize using geo-map with plotly library

**4.b)Use the following bad debt equation:**

Visualization of a bad debt

4.c)Create Box and whisker plot and analyze the distribution for 2nd mortgage, home equity, good debt, and bad debt for different cities

**Box and whisker plot of second mortgage for different citites**

Boxplot for City vs Second Mortgage

Boxplot for City vs Home Equity

Boxplot for City vs Good Debt

Boxplot for City vs Bad Debt

4.d)Create a collated income distribution chart for family income, house hold income, and remaining income

Distribution plots of hi_median, family_median and remaining_income

Boxplot to plot the collated income chart

NOTE:From the above income charts we see the distribution of both household and family is positively skewed probably because some of the states in US are most educated and with best employment and more populated like Texas,California,Massachusetts,New Jersey etc. The total median family income of specified geographic location is more than median houeshold income

5.Perform EDA and come out with insights into population density and age. You may have to derive new fields (make sure to weight averages for accurate measurements):

***5.a)Use pop and ALand variables to create a new field called population density***

***5.b)Use male_age_median, female_age_median, male_pop, and female_pop to create a new field called median age***

***5.c)Visualize the findings using appropriate chart type***

View the population density and median age state wise using Bar plot distribution

Plot the population density barplot

Note: From the both the plots above for population density we observe

Plot the Age Distribution

Note: From both the plot above for median age we observe,

***6.a)Create bins for population into a new variable by selecting appropriate class interval so that the number of categories don’t exceed 5 for the ease of analysis.***

***6.b)Analyze the married, separated, and divorced population for these population brackets***

***6.c)Visualize using appropriate chart typ***e

***Note:***

Note:

**7.Please detail your observations for rent as a percentage of income at an overall level, and for different states**

**Let's visualize the median household income of states**

Note:From the above distribution of state wise rent percentage income by household and rent income, we see states like Florida,Virginia,Georgia and Texas etc...have more rent pct income.

8.Perform correlation analysis for all the relevant variables by creating a heatmap. Describe your findings.

NOTE:

***8. Correlation analysis of all the relevant variables and findings from thereon***

Note:

Project Task: Week 2

***Data Pre-processing:***

Adequacy test to evaluate the 'factorability' of a dataset using KMO(Kaiser-Meyer-Olkin Test) and Bartlett's Sphericity Test

Note:

Exploratory factor analysis

Note:

Note: fa_loadings:- We have 30 columns and as there are 10 factors or latent variables we have recieved 30 rows with 10 columns in factor loadings.

Compute Eigenvalues:Variance explained by a particular factor out of the total variance

Plot the screeplot

Note: A scree plot is a graphical tool used in the selection of the number of relevant components or factors to be considered in a factor analysis.

Note: Based on the above scree plot let us choose 10 factors as elbow is observed at 10

Note:

From the above factor loadings dataframe we can infer the features having strongest loadings. Several factors appears to have weak loadings

We observe from above that we can do a naming conventions to the factors based on features that grouped within those factors having high correlation, and choose only those factors with little or no correlation among themselves so as to work with lesser dimensionality in more meaningful way.

Project task Week 3

Data Modeling :

****Build a linear Regression model to predict the total monthly expenditure for home mortgages loan****

Please refer deplotment_RE.xlsx. Column hc_mortgage_mean is predicted variable. This is the mean monthly mortgage and owner costs of specified geographical location.

Note:

a) Exclude loans from prediction model which have NaN (Not a Number) values for hc_mortgage_mean.

b) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step.

c) Run another model at State level. There are 52 states in USA.

d) Keep below considerations while building a linear regression model:

* Variables should have significant impact on predicting Monthly mortgage and owner costs

* Utilize all predictor variable to start with initial hypothesis

* R square of 60 percent and above should be achieved

* Ensure Multi-collinearity does not exist in dependent variables

* Test if predicted variable is normally distributed

Note:Since we have feautres in our dataset whose values vary in terms of scales, we will use standard scaler in order to scale the the data to unit variance with 0 mean

Note:

Distribution of a predicted variable

Check the residuals

Plot the residuals

Test for normality of errors using QQ-plot

Note:From the above plot we see the theoritical quantiles computed from standard normal distribtuiton and quantiles computed from the sample residuals, i.e standardized residuals nearly fall on the 45 degree line hence we can conclude that the errors in the data come from a standard normal distribution.

Note:

Predicting the total monthly expenditures for home mortgages loan at state level

Project 3 : Week 4

Data Reporting:

Create a dashboard in tableau by choosing appropriate chart types and metrics useful for the business. The dashboard must entail the following:

Box plot of distribution of average rent by type of place (village, urban, town, etc.).

Pie charts to show overall debt and bad debt.

Explore the top 2,500 locations where the percentage of households with a second mortgage is the highest and percent ownership is above 10 percent. Visualize using geo-map.

Heat map for correlation matrix.

Pie chart to show the population distribution across different types of places (village, urban, town etc.).

https://public.tableau.com/app/profile/amruta/viz/RealEstate_16556405910430/RealEstateDashboard

END